Travelling with a Huge Silver Cigar: A study on Subtitle and Script Reliability for Indexing Video Material
نویسندگان
چکیده
Television companies generally keep documentary and archive centres which allow the editors to retrieve video sequences to be integrated in a new production. The disclosure of video material in these centres has always been very time-consuming and expensive. Time-consuming because it had to be done manually, and expensive because it had to be done by experts (librarians, documentalists). Any editor has to use a database to gain access to the contents of video material, that is, each search for a single image or for image sequences relies on manually disclosed material. This study, therefore, is interested in investigating the degree to which video archive systems can utilize the semantic link between image and text as a key for retrieving video fragments for new productions. Our analysis, furthermore, takes a closer look at the assumption that there is a direct correlation between image and text. In order to verify and to support this premise, we focus on an empirical assessment of the subtitle and script reliability as a basis for the indexing of video material. Introduction Video material can be retrieved in two different ways depending on the application: querying by means of a sample image, as in face recognition, or querying by natural language phrases or just keywords, as it is commonly used in TV archives. Our focus is on the latter possibility. Almost all television companies keep documentary and archive centres which allow the editors to retrieve video sequences needed for a new production. The disclosure2 of video material in these centres has always been very time-consuming and expensive. Time-consuming because it had to be done manually, and expensive because it had to be done by experts (librarians, documentalist). Any editor has to use a database to gain access to the contents of video material, that is, each search for a single image or for image sequences relies on manually disclosed material. One possible access to the visual content of any film production can be gained by way of the subtitles and/or the film scripts3. If one takes into account that text retrieval technologies have been successfully developed over the last decades, subtitles and film scripts become the key to very cheap and high quality access to the content of video material. This study, therefore, can be seen as a proof of the concept which uses text indexing as an approach to cataloguing automatically the corresponding video material. On the technical level, script and subtitle texts are linked to video sequences with the help of the time code. On the level of content, the semantic relationship between text and image has been conceived by 1 This work was carried out within the project Pop-Eye funded by the European Union (Language Engineering LE-4234). More information on this project can be found under http://twentyone.tpd.tno.nl/popeye/ 2 The term disclosure refers to either (1) the process of assigning key-words which describe the content of any given video material performed by a documentalist, or to (2) the whole automatic process ranging from video digitizing to any kind of automatic indexing process. 3 A film script is a text containing the exact voice over of a film production. It can either be the text which was used as a basis for production, or the transcript which was made after broadcasting, for example, for legal purposes. the film producer. Our study investigates to which degree this already fixed semantic link, that is, the link defined by the producer, can be utilized to retrieve images with the help of textual references. The semantic relationship between image and text is established by way of combining a specific image with a specific text (Muckenhaupt 1986). The meaning of an image shifts as soon as it is combined with a new textual sequence. Image A combined with text A signifies differently than image A combined with Text B. Any given text thus will allow for a spectrum of possible visualizations. Hence, we cannot necessarily assume that there is a clear cut, unambiguous correlation between text and image. Not everything that is said in the text will be shown in the image or vice versa. This study, therefore, is interested in investigating the degree to which video archive systems can utilize the semantic link between image and text as a key for retrieving video fragments for new productions. Our analysis, furthermore, takes a closer look at the assumption that there is a direct correlation between image and text. In order to verify and to support this premise, we focus on an empirical assessment of the subtitle and script reliability as a basis for the indexing of video material. That is, this study aims at a thorough analysis of the subtitle and script reliability as a key for image queries and for the retrieval of images by using associated subtitle texts. To illustrate the goals of this study, we will look at an example of video segments and corresponding subtitles. Figure 1 shows a storyboard of stills representing the first frame for each subtitle. Our analysis is concerned with the relationship between the text in the subtitles and the content of the images. The term milieu-box, for example, occurs in a subtitle where the corresponding image does not show a milieu-box, but the frame for the next subtitle shows such a green box. (We obviously focused our analysis on the continuous video material rather than on these images representing a selection of frames on the basis of the subtitle time codes.)4 4 This storyboard is made with the Pop-Eye software prototype, which processes the text contained in the subtitles for creating an index of the subtitled video sequences. This prototype allows for automatic disclosure and direct search of digitized video material (de Jong et al., 2000). Fig. 1. Storyboard of stills from the video De Milieubox (© VRT, Brussels, Belgium) Semantic classification of links Our analysis of the subtitle and script reliability intends to establish the degree to which one can assume a direct correlation between a certain textual phrase and the corresponding image. In order to structure this relationship between texts and images, we studied two types of classifications: a. classifications of the textual aspects: semantic classifications used in lexical semantics; b. classifications of the movies’ aspects: structure of the film language or code. Semantic classifications which are commonly used in lexical semantics are ontologies such as the EuroWordNet Base Concepts and TopOntology (Vossen, 1998). Within this European project, a topontology of basic semantic distinctions has been constructed which classifies words within three general categories: 1. the 1stOrderEntity which always includes concrete nouns, like „individual persons, animals and more or less discrete physical objects“; 2. the 2ndOrderEntity which can include nouns, verbs and adjectives indicating „events, processes, state-of-affairs and situations which can be located in time“; 3. the 3rdOrderEntity which always includes abstract nouns, like „ideas, thoughts, theories, hypotheses, that exist outside of space and time and which are unobservable.“ (Vossen, 1998) In his semiotic analysis of the visual film codes, Eco (1972) argues that the perceptual recognition of objects, chains of actions and abstract terms within the image occur on the basis of the iconic, the kinetic and the rhetorical code. Objects can be recognized on a denotative level with the help of the iconic code. Figurae, signs and semantemes produce signifying units within the image which allow us to perceive an object and/or a person, like for example a man, a box or a chair.5 According to Eco, we perceive or recognize human gestures and human acts on the basis of kinemorphemes.6 The temporal succession of kinemorphemes within the visual space of the film image produce codified movements and gestures which then can be read as a sequence of successive actions.7 The three dimensional movement(s) produced by a specific kinetic syntax within the image can be correlated with events and actions on the textual level. Abstract terms are analogous to Eco’s discussion of the rhetorical code. Both, the iconographic and the rhetorical code mark visual semantemes which have become more complex and culturally conventionalized on a connotative level. Elements of a cultural ideology (Eco, 1972) can thus be codified within any given image. That is, iconographic semantemes with the help of rhetorical codes account for abstract terms within the image.8 5 Eco also argues that iconic signs, that is, semantemes cannot simply be seen as the equivalent of a single word, but always have to be understood as a complex utterance. The image of a horse does not signify „horse“ but „white horse is here, standing, in profile“. (Eco 1972, 368). 6 The semiotic field of kinetics, in general, attempts to reveal how human behavior and movement is codified within a sign system. 7 Eco defines kine as the smallest possible units of movements which still can incorporate an autonomous meaning. Kine are thus similar to semantemes because both represent the smallest possible units of signification. (Eco 1972, 373). 8 The image of a single man, for example, who walks down an alley, away from the viewer, incorporates the connotative component of „loneliness“. (Eco 1972, 371). We consequently came to the conclusion that we needed to investigate three types of relations: 1. phrases denotating objects and persons which are shown in the image; 2. abstract terms which are mentioned in the text, and for which the images show good visualizations; 3. events and actions which are reported in the text and shown in the image. As we compared these relationships with a study on enquiries addressed to a TV archive, we discovered an exact correspondence with the coarse grain classes established in Weber (1992). This coarse grain classification9 subsumes the queries within the following classes: 1. abstract queries, 2. queries focusing on specific objects and/or persons, 3. queries related to specific events and actions. The material we used for our analysis proved at an early stage of the investigation that the chosen categories are crucial for disclosing and retrieving video materials. We encountered, for example, quite a lot of possible visualizations of abstract terms, like hunger, pain, poverty and fear. Events and actions were shown and commented not only in feature films, but also in documentaries. Basic units of this investigation We assumed that textual phrases would allow us to retrieve (correlating) images, at least to a certain degree. Our working premise did not presuppose a 1:1 correspondence between the visual image and the given text. We were aware of the fact that not everything visible in the image will reappear in the subtitle or script text. We counted any correlation between the textual phrase (subtitle/text) and the image per shots. Our study thus used film shots as a basic unit of analysis. Thereby, we utilized the time code as a means of cataloguing the text/image correspondence, instead of taking textual phrases as a point of departure. This approach takes into account that editors usually look for visual footage, that is, film images and not for textual phrases as a basis for their new production. The editor’s search for specific quotes, like for example John F. Kennedy’s statement „Ich bin ein Berliner“, or the Belgian prime minister’s recurring statement „No comment“, can be classified as exceptional queries because they focus on both, the textual and the visual content, that is, the quote and its specific visualization are of equal interest. Our analysis of the correlation between image and text focused on the total amount of used footage. We were not interested in assessing the degree to which text retrieval will allow the optimal reuse of a single video. That is, each search for a specific image is aimed at any given footage disregarding the remaining scenes of the video unit. If one searches for an image of the “Mannheimer Wasserturm” (the water tower in Mannheim), for example, it is relevant to find a series of images out of the total sum of digitized film material independently of the reusability of other scenes in the video. In the process of our film analysis, we counted each shot that we could link to the text.10 Images relevant for the textual phrase were counted according to our classification of object/person, 9 Weber’s fine grain classification of research queries is irrelevant to our study. The execution of his fine grain classification model was not clear enough, and therefore, does not add anything to our investigation. event/action and abstract terms. We also took into account that certain shots could be retrieved via more than one category. An image retrieved by the object/person category can, for example, also be retrieved by the category of event/action. Time Code Tolerance We applied a certain degree of tolerance in respect to the film time code. Shot sequences related by a common content were included in our shot count. We decided to count a spectrum of shots as long as they were related to the textual statement. We discovered that some sentences in the text commented a sequence of shots. We counted previous and subsequent shots relevant to the textual sequence, as long as we could recognize a direct correlation between a film segment and a given textual phrase. In the next example, a clown can be seen in all marked stills, whereas the word clowns only occurs in the subtitle for the still in the middle of the storyboard. In this case, all the corresponding shots were counted as shots showing the clown. 10 This work was carried out by a team of four researchers. Each researcher collected the relevant empirical data on his own; only unclear correspondences between text and image were classified by the team. Fig. 2. Example for Time Code Tolerance (from the documentary Dokus en Spruitje, © VRT, Brussels, Belgium) Film Corpus In the process of our video analysis, we used a broad range of broadcast material.11 Our material included Dutch and German film footage. The Dutch material was added at a later point in time which accounts for the much smaller range of catalogued data. Taken together, we accumulated a total of 21:19:51 film hours with 11,417 shots in total.12 We tried to consider various kinds of production categories such as feature films, political TV reports, educational programs, human-interest reports and documentaries, and political documentaries. 13 The majority of our video material did not use subtitles. We, therefore, had to work with the film scripts themselves. In the case of 100 Deutsche Jahre (© SWF, Baden-Baden, Germany), we also analysed the given stage directions as a separate category of interest.15 Table 1 shows the amount of material for each programme category and subject matter. Film category total length nb. of shots subtitles? language yes Dutch human interest documentary 8:50:37 3604 no German society documentary 4:41:11 3198 no German sports documentary 0:28:34 351 no German tourism documentary 1:19:32 789 yes Dutch feature film 3:24:0
منابع مشابه
Semantic Video Indexing and Summarization Using Subtitles
How to build semantic index for multimedia data is an important and challenging problem for multimedia information systems. In this paper, we present a novel approach to build a semantic video index for digital videos by analyzing the subtitle files of DVD/DivX videos. The proposed approach for building semantic video index consists of 3 stages, viz., script extraction, script partition and scr...
متن کاملSpatial Analysis of the Impact of Commercial Tourism on Sustainable Development of Rural Areas (Case Study: Dehshikh – Cigar Business Area on Lamerd County)
The present study seeks to evaluate the spatial indicators of sustainable development of rural settlements in the Dehshikh-Cigar commercial area, Lamerd city, Fars province. This research is descriptive-analytical, The statistical population of the study includes all villagers living in 13 villages of Dehshikh-Cigar commercial area in Lamerd city. Using Cochranchr(chr('39')39chr('39'))s formula...
متن کاملCompressed Domain Scene Change Detection Based on Transform Units Distribution in High Efficiency Video Coding Standard
Scene change detection plays an important role in a number of video applications, including video indexing, searching, browsing, semantic features extraction, and, in general, pre-processing and post-processing operations. Several scene change detection methods have been proposed in different coding standards. Most of them use fixed thresholds for the similarity metrics to determine if there wa...
متن کاملRobustness of Case-Initialized Genetic Algorithms
We investigate the robustness of Case Initialized Genetic AlgoRithm (CIGAR) systems with respect problem indexing. When confronted with a series of similar problems CIGAR stores potential solutions in a case-base or an associative memory and retrieves and uses these solutions to help improve a genetic algorithm’s performance over time. Defining similarity among the problems, or indexing, is key...
متن کاملFacile Approach to Synthesize and Characterization of Silver Nanoparticles by Using Mulberry Leaves Extract in Aqueous Medium and its Application in Antimicrobial Activity
There is a huge demand of silver nanoparticles in the global market due to their special properties and applications in different fields such as nanomedicine , dentists , nanocatalysis, nanoelectronics, textile field, waste water treatment.The major cons of top down and Bottom up methods are the synthesis processes are highly costly, time consuming and many harmful chemicals are used. To reduce...
متن کامل